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(54) Abstract Title 

Data processing apparatus 

(57) A data processing apparatus (400, 800) including a 
plurality of functional units (402-412, 802-812) (for instance 
ALUs, program counters, multipliers); each functional unit 
performs a set of prescribed operations. An Instruction 
decoder functional unit (404, 804/814) decodes a current 
instruction into functional unit signals so that the plurality 
of functional units perform a corresponding Instruction 
task. A communication device (416, 820) couples the 
functional units to one another. The apparatus may be an 
asynchronous digital processor (800) (n which case 
self-timing and inter-block communication are used to 
Implement a self-timed scheduler. The self-timed 
scheduler may Include; a decoder controller (804); a 
scheduler controller (814) which decodes each current 
Instruction to generate functional unit schedule and 
control Information; a communication device (820); and a 
plurality of scheduler functional unit controllers (816). Each 
of the plurality of functional units is associated with a 
respective scheduler functional unit controller (816). The 
scheduler controller (8U) controls the operations of the 
functional units and implements a set of Instructions that 
takes account of the data dependencies between, and the 
variable execution times of, the functional units (802-812). 
Furthermore, an entire instruction or instruction set may 
be modified by a programmable circuit (1302, fig 13). The 
modification may be Implemented during Initialization or 
execution. 
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PATA PROCESSING APP^TOS 

1. pjy]f^»fth<^ Invention 

The present inventioB relates to ascheduler and method for a digital processor 
used in data processing systems, and to a real-time user defined instruction set and 
method for a digital processor. 

2, ftprk grftunt} ftf t\]^ Bflatgd Art 

A processor such as a microprocessor, micro cdntroUer or a digital signal 
processor (DSP) processor includes of a plurality of functional units, each with a 
specific task, coupled with a set of binary encoded instruttions that define operations 
on the functional units within the processor architecture. The binary encoded 
instructions can then be combined to form a program that performs some given task. 
Such programs can be executed on the processor architecture or stored in memory for 

subsequent execution. 

To operate a given program, the funaional units within the processor 
architecture must be synchronized to ensure correct (e.g., time, order, etc.) execution 
of instructions. "Synchronous" systems apply a fixed time step signal (i.e., a clock 
signal) to the functional units to ensure synchronized execution. Thus, in related art 
synchronous systems, all the functional units require a clock signal. However, not aU 
functional units need be in operation for a given instruction type. Since the functional 
units can be activated even when unnecessary for a given instruction execution, 
synchronous systems can be inefficient. 

The use of a fixed time clock signal Ci.e., a clock cycle) in synchronous systems 
also restricts the design of the functional units. Each functional unit must be designed 
to perform iu worst case operation within the clock cycle even though tlie worst case 
operation may be rare. Worst case operational design reduces performance of 
synchronous systems, especially where the typical case operation executes much faster 
than that of the worst case criteria. Accordingly, synchronous systems attempt to 


reduce the clock qrcle to minimize the performance penalties caused by worst case 
operation criteria. Reducing the clock cycle below worst case criteria requires 
increasingly complex control systems or increasingly complex functional units. These 
more complex synchronous systems reduce efficiency in terms of area and power 
consumption to meet a given performance criteria such as reduced clock cycles. 

Related art self-timed systems, also known as asynchronous systems, remove 
many problems associated with the clock signal of synchronous systems. Accordingly, 
in synchronous systems, performance penalties only occur in an aaual (rare) worst 
case operation. Accordingly, asynchronous systems can be tailored for typical case 
performance, which can result in decreased complexity for processor implementations 
that achieve the performance requirements. Further, because asynchronous systems 
only activate functional units when required for the given instruction type, efficiency 
is increased. Thus, asynchronous S3rstems can provide increased efficiency in terms of 
integration and power consumption. 

By coupling such functional units together to form lai^er blocks, increasingly 
complex functions can be realized. Figure 1 shows two such funaional units coupled 
via data lines and control lines. A first funaional unit 100 is a sender, which passes 
data. The second functional unit 102 is a receiver which receives the data. 

Communication between the functional units 100, 102 is achieved by bundling 
data wires 104. Self-timed or asynchronous methodology uses functional units with an 
asynchronous interface protocol for the passing of data and control status, with two 
control wires. A request control wire REQ is controlled by the sender 100 and is 
activated when the sender 100 has placed valid data on the data wires 104. An 
acknowledge control wire ACK is controlled by the receiver 102 and is activated when 
the receiver 102 has consumed the data that was placed on the data wires 104. This 
asynchronous interface protocol is known as a "handshake" because the sender 100 and 
the receiver 102 both communicate with each other to pass the bundled data. 


The asynchronous itaterfKe protocol shown m Figure 1 can me various timing 
protocoU for data communication. One rdatcd art protocol is based on a 4-phase 
control communication scheme. Figure 2 shows a timing diagram for the 4-phase 
control communication scheme. 
5 As shown in Figure 2. the sender 100 indicates that the data on the data wires 

104 is valid by generating an active request control wire REQ high. The receiver 102 
can now use the data as required. When the receiver 102 no longer requires the data, 
it signals back to the sender 100 an active acknowledge control wire ACK high. The 
sender 100 can now remove the data from the communication bus such as the data 
10 wires 104 and prepare the next communication. 

In the 4-phase protocol, the control lines must be returned to the initial state. 
Accordingly, the sender 100 deactivates the output request by returning the request 
control wire REQ low. On the deactivation of the request control wire REQ. the 
receiver 102 can deactivate the acknowledge control wire ACK low to indicate to the 
15 sender 100 that the receiver 102 is ready for more data. The sender 100 and the 
receiver 102 must follow thb strict ordering of events to communicate in the 4-phase 
control communication scheme. Beneficially however, there is no upper bound on the 
delays between consecutive events. 

A first-in first-out (FIFO) register or pipeline provides an example of self-timed 
20 systems that couple together a number of functional units. Figure 3 shows such a self- 
timed FIFO structure. The functional units can be registers 300a-300c with both an 
input interface protocol and an output interface protocol. When empty, each of the 
registers 300a-300c can receive data via an input interface 302 for storage. Once data 
is stored in the register the input interface cannot accept more data. In this condition. 
25 the register 300a input has "stalled". The register 300a remains stalled until the register 
300a is again empty. However, once the register 300a contains data, the register 300a 
can pass the data to the ne« stage (i.e., register) of the self-timed FIFO structure via 
an output interface 304. The registers 300a generate an output request when the data 
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ro be output is valid. Once the data has been consumed and the data is no longer 
required, the register 300a is then in the empty state. Accordingly, the register 300a 
can again receive data using the input interface protocol 

Chaining the registers 300a-300c together by coupling the output interface 304 

5 to the input interface 302 forms the multiple stage FIFO or pipeline. Thus, an output 
interface request and acknowledge signals, Rout and Aout, are respectfully coupled to 
the following register 300a-300c (stage) input interface request and acknowledge 
signals, Rin and Ain. As shown in Figure 3, data passed into a FIFO input 306 will be 
passed from register 300a to register 300c to eventually emerge at a FIFO output 308. 

1 0 Thus, data ordering is preserved as the data is sequentially passed along the FIFO. The 
FIFO structure shown in Figure 3 can use the 4-phase control communication scheme 
shown in Figure 2 as the input and output interface protocol. 

To implement an asynchronous processor, a more complex array of functional 
units is required. Further, to process an instruction, the instruction must be decoded 

1 5 to aaivate the funaional units required to perform the corresponding mstruaion task. 
However, to execute the instruction, the functional units may have dependencies such 
as data dependencies so that the functional units can not merely operate concurrently 
(e.g., within a clock cycle as in synchronous systems). Such dependencies enforce 
sequential operations on the functional unit activity to correctly execute each 

20 instruction. 

An asynchronous processor is disclosed in "A Fully Asynchronous Digital 
Signal Processor Using Self-Timed Circuits" by Jacobs et al., IEEE Journal of Soiid- 
State Circuits, Volume 25, Number 6, 1990 (hereafter Jacobs). However, the 
asynchronous processor in Jacobs merely initiates a preset activation order of all 
25 functional units regardless of the instruction. Accordingly, the asynchronous 
processor in Jacobs has disadvantages in that inefficiencies occur because unnecessary 
functional units are activated for a given instruction. Further inefficiencies occur 
because the ability to exploit potential concurrent operations by functional units that 
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do not have data dependencies is lacking. In addition. Jacobs can not individuaUy 
control the order and etecution of the functional unit activity for each instmction to 
increase concurrency and efficiency. 

Further, Jacobs can not implement a real time definition of instructions or a 
5 completely hardware programmable architecture. 

The above references are incorporated by reference herein where appropriate 
for appropriate teachings of additional or alternative details, features and/or technical 
background. 

JO f^TT ^/tM ARY OF 1^ TTMWNTTON 

An object of the present invention is to provide an apparatus and method for 
system control that obviates at least the above^escribed problems and disadvantages 
of the prior art. 

Another object of the present invention is to provide bdividual control of 
15 functional unit activity for each mstruction defined for an asynchronous digital 
processor. Another object of the present invention is to provide an asynchronous 
digital processor with a high degree of processor concurrency for each defined 
instruction . 

Another object of the present invention is to provide an apparatus and method 
20 for user modification of a CPU instruction set in real time. 

Another object of the present invention is to provide an apparatus and method 
for end-user customization of the ordering of instruction execution in real time. 

Another object of the present invention is to provide an apparatus and method 
for reconfiguration of architectures such as VLSI architecture under software control. 
25 Another objea of the present invention is to provide an apparatus and method 

for controlling operations requiring dependencies for functional units in asynchronous 
systems. 
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Another objea of the present invention is to provide a high speed asynchronous 

digital processor and method. 

Another object of the present invention is to provide a dynamic correlation 
between an instruction set and a given processor architeaure. 
5 Additional advantages, objects, and features of the invention will be set forth 

in pan in the description which follows and in part will become apparent to those 
having ordinary skill in the art upon examination of the following or may be learned 
from practice of the invention. The objects and advantages of the invention may be 
realized and attained as particularly pointed out in the appended claims. 

10 

^fjfv nF5;rRTPTiON OF THF OR A^WINGS 
The invention will be described in detail with reference to the following 
drawings in which like reference numerals refer to like elements wherein: 
Figure 1 is a block diagram showing a self-timed data interface; 
15 Figure 2 is a diagram showing signal waveforms of a four-phase data interface 

protocol; 

Figure 3 is a block diagram showing a self-timed FIFO structure; 
Figure 4 is a block diagram showing a digital processor; 
Figure 5 is a digram showing operations of an instruction pipehne; 
20 Figure 6 is a diagram showmg a functional flow of the processor of Figure 4 for 

an instruction; 

Figure 7 is a diagram showing a functional flow of the processor of Figure 4 for 

another instruction; 

Figure 8 is a diagram showing a preferred embodiment of a scheduler in a self- 

25 timed processor according to the present invention; 

Figure 9 is a diagram showing a preferred embodiment of a scheduler funaional 
unit controller according to the present invention; 

Figure 10 is a diagram showing an asymmetric C^ate; 
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Figure 11 is a diagram showing a complex asymmetric 4.input C-gate; and 
Figure 12 is a diagram showing a preferred embodiment of a schedule control 
logic according to the present invention; 

Figure 13 is a diagram a preferred embodiment of a programmable structure for 

5 a decode funaional unit; and 

Figure 14 is a diagram showing a flowchart of operations of the programmable 

decode functional unit of Figure 13. 

10 An exemplary processor 400 architecture is shown in Figure 4. The processor 

400 architeaure includes functional units, for example, used in a microprocessor, a 
micro controUer and DSP implementations. Each of functional units are coupled by 
a common resource data bus 416. 

A program counter functional unit PC 402 generates an instruaion program 

15 address. The PC 402 includes an address stack for holding addresses on subroutine or 
interrupt calls. An instruction decoder functional unit 404 controls instruction fetch 
and decode. The instruction decoder funaional unit 404 contains an instruction 
decoder for generating the control of functional units and a status register for holding 
current process sutus. An arithmetic and logic functional unit ALU 406 performs 

20 data and arithmetic operations usii^ an integer arithmetic ALU. The ALU 406 also 
contains a data accumulator for storing a result of a specific data or arithmetic 
operation. 

The processor 400 further includes a mukiplier functional unit MULT 408 that 
performs data multiplication and an indirect address register functional unit ADDR 
25 410. The ADDR 410 holds indirect data addresses in an address register array. A 
Random Access Memory functional unit RAM 414 is used to store data values. A data 
RAM control functional unit RAMC 412 controk memory access for data memory 
in the RAM 414. 
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In the processor 400, the functional blocks can operate concurrently. However, 
the processor 400 must ensure correct management of the common resource data bus 
416 by controlling data and sequence requirements when communications occur 
between functional units. Thus, the processor 400 must resolve functional unit 
dependencies such as data dependencies between functional units. Preferably, the 
architecture of the processor 400 controls communications between functional units. 

The processor 400 preferably uses a 3-stage instruction pipeline composed of 
instruction fetch, instruction decode and instruction execute cycles. A pipelined 
architecture improves performance requirements by allowing more efficient (e.g., 
concurrent) use of the functional units of the processor architecture. As shown in 
Figure 5, the 3 stage instruction pipeline allows each pipelined stage to be overlapped, 
which increases concurrency and processor performance. 

To implement a program on a processor architecture, such as the processor 400, 
a set of instructions and corresponding instruction tasks must be defined. During 
operations, each instruction is decoded to activate the functional units required to 
perform the corresponding instruction task. However, to execute the corresponding 
instruction task for each such instruction, individual functional units may have 
dependencies such as date dependencies. In this case, the functional units required to 
be aaivated cannot operate concurrently but must be stalled until a particular 
condition (e.g., that solves the data dependency) is valid. 

For example, consider an instruction "ADD which is the instruction to 
Add Indirectly Addressed Data to the data accumulator in the ALU 406. As shown 
in Figure 6, there is a dependency between functional units required by the processor 
400 to perform the ADD*+ instruction. Prior to the execution of the addition within 
the ALU 406, the operand required by the ADD*+ instruction must be read from 
memory. The ALU 406 must therefore stall until the operand has been read from the 
RAM 414 and is valid on the data bus 416. In the processor 400, memory access is 
controlled by the RAMC 412. However, prior to the data read being from memory 
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the source address of the data must be generated. For mstruciions using a direct mode 
of addressing the address pointer is usually within the instruction word. Thus, for 
direct mode addressing the address pointer can be passed with other decode 
information to theRAMC 412. However, the ADD*+ instruction uses the indirect 

5 mode of addressing. For indirea mode addressing, the address source must be 
generated by being read from the address register array of the ADDR 410, and the 
RAMC 412 must staU untU the address is valid on the data bus 416. 

Such dependencies illustrate sequential requirements enforced on the activity 
of the functional units of the processor 400 to ensure correct execution of a given 

10 instruction. Thus, Figure 6 shows a flow diagram of activated funaional units for the 
execution of the ADD* + instruaion by the processor 400. The PC 402 is also 
preferably activated by the ADD»+ in«ruction. The PC 402 is required by the 
processor 400 to generate the next instruction address and retrieve the next instruction 
from program memory. The concurrent operation of the PC 402 with the ADDR 410 

15 ensures that the imtniction pipeUne of the processor 400 maintains a constant flow of 
instructions. 

Figure 7 shows a flow diagram of activated functional units in the processor 400 
for another instruction, an Indirect Store Accumulator instniaion "STA*+ The 
"STA**" instruction stores the data accumulator to an indirectly addressed dau 
20 location. For this example, prior to the activation of the RAMC 412 to store a dau 
bus value into the data memory RAM 414, both the accumulator data must be driven 
onto the data bus 416 by the ALU 406 and an indirect address must be generated by 
the ADDR 410. 

There U no interdependency between the ALU 406 and the ADDR 410 for the 
25 Indirect Store Accumulator instruction STA»+. Thus, the ALU 406 and the ADDR 
410 can operate concurrently. However, the activation of the RAMC 412 is stalled 
until both the ALU 406 and the ADDR 410 have completed their tasks. Again, in 
Figure 7, the PC 402 is shown operating in parallel with the STA»+ instruction 
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execution because the program instruction address and fetch are required to fill the 

instruction pipeline. 

The above-described instruaion examples illustrate the complexity of 
management required to order and aaivate each functional unit within a processor to 

5 execute an instruction. Further, many instructions can be defined in an instruction 
set. Thus, to achieve a predetermined level of concurrancy for the functional units 
activated when performing an instruction set, many different flow diagrams are 
required. Similarly, such management control would be required for an asynchronous 
system using a set of functional blocks to perform defined operations. 

10 To achieve the control required for implementing an instruction set on a 

processor such as the processor 400 architecture, a preferred embodiment of an 
apparatus and method for asynchronous system control according to the present 
invention wiU now be described. As shown in Figure 8, a processor 800 architecture 
includes the preferred embodiment of a self-timed scheduler. The processor 800 

15 includes exemplary functional units: a PC 802, an ALU 808, aMULT 808, an ADDR 
810, a RAMC 8 12. The operations of the functional units of the processor 800 are 
similar to the operation of the functional units of the processor 400 described above. 
Accordingly, a detailed description will be omitted. However, the present invention 
is not intended to be so limited. Additional, fewer or alternative functional units used 

20 to implement the processor 800 based on the intended operational requirements and 
environment would be within the scope of the present invention. 

As shown in Figure 8, a decode functional unit is split into two separate 
functional units including a decode instruction functional unit 804 and a scheduler 
controller functional unit 814. The decode instruction control 804 controls the self- 

25 timing of the instruaion execution phase and contains additional functionality such 
as status registers. The scheduler controller 814 decodes the current instruction and 
also generates the relevant functional control bundles for each of the functional units. 
The scheduler controller 814 thus preferably incorporates similar functionality of 
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portions of the instruaion decoder 404. Figure 8 also shows a scheduler control bus 
820 that feeds the relevant schedule control data to each functional unit via a scheduler 
functional unit controUer sl?; functional unit request control lines 822 and functional 
unit acknowledge control lines 818. 

The self-timed scheduler includes the scheduler controller 814, the scheduler 
control bus 820 and the self-timed scheduler functional unit controller 8 16 in the input 
protocol for each functional unit of the processor 800 and corresponding additional 
control bits bundled with the control data bus. The additional control bits that 
implement the self-timed scheduler functionality are preferably generated, along with 
the required functional control bits for each functional unit, within the scheduler 
controUer unit 814. Control data bundles are preferably generated using a 
programmable logic array (PLA) where each instruction mnemonic is input to the 
PLA and the appropriate control bundles for each functional unit are generated as 
output. 

Operations of the preferred embodiment of the self-timed scheduler will now 
be described. As shown in Figure 8, for an execution cycle of each instruction in the 
processor 800, the decode instruaion control 804 generates an active execute request 
signal ExecReq to all functional units. At this point, all the control data bundles from 
the scheduler controUer 814 for each of the functional units arc valid because the 
control data bundles were generated in die execution cycle for the previous instruction 
(e.g., see Figure 6). 

On receipt of the execute request signal ExecReq, each functional unit will 
activate based on control information in the scheduler controller 814 control data 
bundle transmined via the scheduler control bus 820. The control information can 
25 preferably initiate one of three possible operations in tiie functional unit. The three 
operations include Bypass. Aaivate (unconditional) and Stall (conditional). However, 
the present invention is not intended to be so Umited. For example, any set of 
operations that accomplish at least the following operations could be used. 
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In the Bypass operation, a functional unit is not required for the execution of 
an instruction task for the current instruction. Thus, the functional unit is bypassed. 
The functional unit will immediately generate a corresponding acknowledge signal 
(e,g., ALUAck for the ALU 806) signaling its completion. In the Activate 
(unconditional) operation, the funaional unit begins operations as defined by the 
control information from the scheduler controller 814. On completion of its function, 
the functional unit generates the corresponding acknowledge signal. In the Stall 
(conditional) operation, the functional unit stalls until one or more additional 
functional units have completed their respective operations (e.g., a funaion and the 
corresponding acknowledge signal). For the stall (conditional) operation, the 
functional unit has dependencies based on the operations of other functional units. 
Accordingly, the funaional unit must wait until the completion of the one or more 
functional units on which its activation is stalled. 

The scheduler functional unit controller 816 for each functional unit stalls a 
corresponding functional unit until the data dependencies have been resolved. In other 
words, the scheduler functional unit controller 816 for each functional unit monitors 
the acknowledge signals, for example, by using the acknowledge control wires 818 of 
the additional funaional units on which its activation is stalled. As discussed above, 
on completion of its funaion, each of the functional units activates the corresponding 
acknowledge control wire 818. When the scheduler funaional unit controller 816 has 
successfully monitored the completion of all the functional units on which its 
activation is stalled, the functional unit can be aaivated, carry out its funaion, and 
acknowledge back to the decode instruaion control 804 its completion. 

As the processor 800 preferably uses a 4-phase control protocol, the decode 
instruaion control 804 preferably initiates a recovery cycle when all functional units 
have completed their functions as signaled through the acknowledge control lines 818. 
The decode instruaion control 804 can then prepare for the next instruction execution 
cycle. 
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However, with the 4.phase control protocol, two functional units in the 
processor 800 cannot be dependent on each other because a stall in this case can lead 
to a deadlock condition. In the deadlock condition, neither functional unit activates 
until the other functional unit has completed its function. However, the present 
invention is not intended to be limited to disaUow cross-dependency caused by the 4- 
phase control protocol. For example, to prevent deadlock an alternative interface 
protocol or apriority scheme could be used to permit functional units to be dependent 
on each other. 

The scheduler controller 814 is not a functional unit in the same sense as the 
ALU 806 or RAMC 812 functional units. The scheduler controller 814 decodes the 
instruction for the next execution phase. Therefore, the scheduler controller 814 
operates in parallel with the current instruction execution cycle and outputs data that 
controU all the funaional units. Thus, the scheduler controller 814 cannot be updated 
until all functional units have completed and the previously executed control data 
bundle is no longer required. Use of a 4.phase control protocol can be used for the 
self-timed scheduler operations because the decode instructiori control 804 sets the 
ExecReq signal low upon entering the recovery phase. The low ExecReq signal 
indicates to the scheduler controller 814 that its preceding control data bundle is no 
longer required and can be updated for die next execution cycle. Accordingly, the 
scheduler comroller 814 returns an acknowledge signal PLAAck low to indicate that 
the new control data bundle is now vaUd and the next execution cycle can be activated. 

Thus, the preferred embodiment of the self-timed scheduler can dynamically 
define execution ordering, concurrency and sequentiality of all the functional units 
under its control. Further, additional instruaions can be subsequently added to the 
25 instruction set, for example, by only implementing the added instruction mneumonic 
in the PLA to generate a control dau bundle. Similarly, subsequent functional units 
can easily be incorporated into the processor architecture by using a protocol such as 
a scheduler functional unit controller. 
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A preferred embodiment of a scheduler fimaional unit control circuit will now 
be described. However, the present invention is not intended to be limited to this 
because alternative interface protocols could be used. Figure 9 shows a circuit diagram 
of such a scheduler functional unit control circuit 900 that can be used as the scheduler 
5 functional unit controUer 816 in the processor 800. Accordingly, the scheduler 
funaional unit control circuit will be described based on 4-phase control protocol. 

The scheduler functional unit control circuit 900 includes two asymmetric C- 
gates 902, 904, a complex 4-input Ogate 906, and two NOR gates 908, 910. Ogates 
can operate as an AND function for self-timed events. Many different 

10 implementations of C-gates exist, however, all perform the basic funaionality that 
one input condition instantiates a high on the C-gate output, a different input 
condition instantiates a low on the Ogate output and the remaining input conditions 
of the C-gate input pins retain a previous set output. Figure 10 shows an exemplary 
4-input asymmetric C-gate element 100, which is a special case of the standard Muller 

15 C-gate, This form of C-gate is known as an asymmetric C-gate because all input pins 
effect the seuing of the gate output high, however, only one input pin effects the 
setting of the gate output low. The Asymmetric C-gate 1000 shown in Figure 10 has 
the following funaionality: 

20 IF InB AND InNl AND InN2 AND InN3 THEN Out - High; 

ELSE IF/InB THEN Out Low; and 
ELSE no change on Out. 

The asymmetric C-gate 1000 is preferably used as the C-gates 902, 904 in the scheduler 
25 functional unit controller 900. 

As shown in Figure 11, the complex asymmetric C-gate 906 has the following 

functionality: 
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IF InB AND (InO OR (InAl AND InA2)) THEN Out - High; 
ELSE IF/InB THEN Out -♦ Low; and 
ELSE no change on Out. 

The scheduler functional unit control circuit 900 includes: request, 
acknowledge, start, finish, and reset interface signals. The request signal Req requests 
input (e.g.. from the decode instruction «>ntrol 804) to initiate functional unit activity. 
The acknowledge signal Ack U output on completion of the functional unit activity 
(e.g., bypass). The start signal Start is the functional unit internal aaivate signal. The 
finish signal Finish is the funaional unit internal completion signal. The reset signal 
Ra is a circuit reset signal that initializes the state of control logic. As shown in Figure 
9, the reset signal Ra is set active high. 

Figure 9 also shows an additional depending control unit 912. Inputs to the 
dependency control unit 912 preferably include the control data bundle and the 
monitored functional unit acknowledge signals. 

Operations of the scheduler functional unit controller 900 will now be 
described. After initiaUzation. the scheduler functional unit controller 900 is activated 
by a positive going input request signal, Req. The request signal Req will then 
transitions an output of one of the two Ggates 902. 904 high dependent on the control 
signals, Bypass and Execute. If the Bypass signal is high then the output of C-gate 904 
will transition its output high. The high output of the C-gate 904 indicates that the 
corresponding functional unit (not shown) will be bypassed. The C-gate 904 then 
activates a signal ByPR. which in turn transitions an output of the Ogate 906 high to 
generate the acknowledge signal, Ack The scheduler functional unit controller 900 
then must go through a recovery phase with the input request signal Req going low. 
which in turn resets the output acknowledge signal Acq low. 

If, on an active request, the control signal Execute is high, the output of the C- 
gate 902 wiU transition high. The high output of the Ogate 902 activates the 
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corresponding funaional unit internal signal Start to initiate its activity. Completion 
of the corresponding functional unit activity is signaled by the internal signal Finish 
transitioning active high. The high internal signal Finish in turn will transition the 
output of the complex Ogate 906 high, which generates the output acknowledge signal 
5 Ack, 

The scheduler functional unit controller 900 preferably has additional 
functionality that adds a safety feature to the circuit. The signals Finish and ByPR are 
fed through the 2.input NOR gate 908 to prevent the incorrect output signd 
transitions of the inactivated Ogates 902, 904. The NOR gate 908 allows the control 

1 0 signak, Exeatte and Bypass, to become undefined after an active acknowledge signal Acq 
has been generated. The feedback loop disables the activation (aaive high output 
signal) of either Cgates 902, 904 until both gates have returned to their initial low 
state, which indicates the completion of their 4-phase protocol cycle. Note that the 
control signals Execute and Bypass muse never be both aaive (high) when an active high 

15 request occurs. In this case, both Ggates 902, 904 wiU activate causing unpredictable 
behavior and possibly deadlock. 

The control signals Bypass and Execute are preferably generated by the self-timed 
scheduler from the contrd data bundle and monitored functional unit acknowledge 
signals. Figure 12 shows an exemplary control circuit 1200 example for a functional 

20 unit with dependencies on two other functional units, A and B (not shown). The 
control circuit 1200, for example, can be used as the dependency control unit 912 of 
the scheduler functional unit controller 900. 

For the control circuit 1200, a 3-bit schedule control data bundle is required. 
The 3-bit schedule control data bundle is shown as the data bundle SchCtrl[2:0] in 

25 Figure 12. Exemplary bit definitions of the data bundle SchCtrl[2:01] the bit signal 
SchCtrlfOJ being high activates the corresponding funaional unit. If the bit signal 
SchCtrlfOJ is low, the corresponding functional unit is bypassed. If the bit signal 
SchCtrl[lJ 'is high, wait for the funaional unit A acknowledge signal, UnitAAck, high 
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before activating the corresponding functional unit. If the SACtrUlJhlt signal is low 
do not wait before activation. The SchCtrlt2] bit signal operates sinuiar to the 
$ckCtrl[l} except in relation to the functional unit B. 

Therefore, if SchCtrl[OJis low, the Bypass signal is set high and the functional 
unit is not activated. Note that when the Bypass signal is set high, the Execute signal 
is preferably forced low to ensure that botii signals do not activate at the same time. 
If Scham is high, the Bypass signal is disabled However, tiie activation of the 
Execute signal is dependent on the remaining bits of die schedule control data bundle. 
If all remaining bits (e.g., SchCtrl[2:lJ) are low, then the functional unit has no data 
dependencies and die Execute signal will be set high to aaivate the funrtional unit. If 
remaining control bits of die scheduler control data bundle are high, tiien data 
dependencies exist and the Execute signal is remains low to stall activation of the 
funaional unit. If die control bits SchOrllOJ'^ SchCtrl[l]^e both high, the Execute 
signal wiU stay low until the UnitAAck signal goes high, widi die resolution of the 
dependency in functional unit A allowing the stalled functional unit to proceed. If 
more than one bit of the control data bundle is set (e.g., SchCtrl[lJ^d SchCtrlflJ both 
high), the Execute signal wUl remain low until each selected dependency is resolved. 

Although the control circuit 1200 monitors two functional units, A and B, the 
present invention is not intended to be so Umited. Accordingly, the self-timed 
scheduler can monitor one, three or more dependencies. However, a predetermined 
amount of set-up time is required to guarantee that.the Bypass and £xec«te signals are 
in a valid state before die active high request signal Req is input to die schedule control 
circuit 1200. 

As shown in F^e 8, a decode functional unit is split into two separate 
ftmctional units including die decode instruaion functional unit 804 and die scheduler 
controller functional unit 814. However, die function and implementation of 
decoding instructions and generating the relevant functional control bundles for each 
of the funaional units can be implemented as a user programmable structure. Thus, 


17 


10 


for example, the scheduler controUer functional unit 8 14 or portions of the instruction 
decoder 404 can be implemented as the user programmable structure. 

For example, an instruction set can be specified by the decode information. 
Thus, with a user programmable structure, the decode information can be modified 
by using a programmable structure to implement the decode information that maps 
between an instruaion (e.g., an instruaion bit panern) and a set of functional blocks 
to be activated. End users can specify a predefined instruction set and any additional 
instructions desired or an end-user defined instruction set (as long as the desired 
instructions can be mapped onto the hardware). In this case, the end-user is 
considered to be someone other than the manufacturer. Further, the same hardware 
(e,g., digital processor, chip, or the like) can support multiple, differing instruction 
sets, which can be changed by simply reloading the user programmable structure. 

A preferred embodiment of a user programmable structure for a decode 
functional unit according to the present invention will now be described. As shown 
15 in Figure 13, the digital processor 800 includes a user programmable scheduler 
controller 1302. The self-timed processor 800 was described above, and accordingly, 
a detailed description is omitted. 

The user programmable scheduler controller 1302 can be implemented using 
registers, RAM, ROM, a combination of registers, RAM and ROM, or the like. Using 
20 RAM or registers allows the user programmable scheduler controUer 1302 to load the 
decode information from an external source as shown in Figure 13. As described 
above, the decode information specifies which fimctional units arc activated and 
sequences the activated fimaional units in the desired order for each instruaion to be 
executed. Therefore, the user programmable scheduler controller 1302 in the self- 
25 timed distal processor implements an architeaure that is completely programmable 
after manufacture, as long as the fimctional units necessary to perform the intended 
operation exist in the digital processor. 
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A ROM based implementation of the user programmable scheduler coniroUer 
1302 implements the hardware or hardwired function of the decode information, 
which includes at least a decode table, into fixed software or ROM. As discussed 
above, the decode table translates the instruction mnemonic into the functional unit 

5 schedule and control information. In the ROM implementation, the decode table 
would be fixed upon encodmg into the ROM. Alternatively, limited modification 
could be incorporated using flash ROM technology, EEPROM or the like. 

A RAM based implementation of the user programmable scheduler controller 
1302 existing in an external source can be boot loaded during initialization of the 

10 processor 800 to copy the decode table into internal memory. The copied instruction 
set could then be used by the signal processor, main processor or the like of the digital 
processor 800. The external RAM source can be an external memory, internal 
memory or another source of programmable memory on a chip that indudes the 
processor 800. Further, the external RAM can include multiple various predetermined 

15 instruction sets. In this case, one of the multiple instruction sets could be selected by 
a user during or before initiahzation. Alternatively, one of the multiple instruaion 
sets can be selected based on existing or detected conditions. For example, the selected 
one of the instruction sets could be selected based on a setting of external pins. In 
particular, upon error detection, a specific set of error correction instructions could 

20 be loaded. Further, the selected instruction set could be loaded either before 
initialization, during initialization or during subsequent operations. 

Generally, for ROM and RAM based implementations, the instruaion set is 
loaded sequentially into memory. Then, address bits in the instruction or the 
instruction mnemonic are used to access the desired memory location. Accordingly, 

25 as the instruction set increases in size, more bits are required in the instruction and 
more addresses in memory. For example, an 8 bit instruction can represent up to 256 
separate instructions and needs 2' or 256 memory locations. Similarly, a 16 bit 
instruction needs 2" or 65536 memory locations. 
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A register based implementation of the user programmable scheduler controller 
1302 generally uses an array of registers. The register based implementation uses the 
register array to implement decode table or a look-up table representation of the 
functional unit schedule and control information. Beneficially, the register based 

5 implementation allows more flexibiHty because the instructions are not translated Into 
sequential locations in memory. Thus, a sparsely populated memory can be 
represented. The user progranmiable scheduler controller 1302 has been 

described as various implementations of ROM, RAM or registers. However, the 
present invention is not intended to be limited to this. For example, any user 

10 programmable structure can be used for the user programmable scheduler controller 
1302. 

Operations of the user programmable scheduler controller 1302 will now be 
described with respert to Figure 14. As shown in 14, the process starts in step 1400 
where control continues to step 1402, 

In step 1402, the decode table is loaded in a programmable unit (e.g., ROM, 
RAM, registers or the like) such as the user programmable scheduler controller 1302. 
In step 1402, muhiple varying versions of the decode table can be loaded. The decode 
table can be loaded one instruction at a time or in blocks of muhiple instructions. 
From step 1402, control continues to step 1404. In step 1404, a check is made for 
initialization. If the determination in step 1404 is affirmative, control continues to 
step 1406. Alternatively, if initialization is not detected, control returns to step 1404. 

In step 1406, initialization begins. From step 1406, control continues to step 
1408. In step 1408, a check is made for a decode condition. If the decode condition 
is not set in step 1408, control continues to step 1410. In step 1410, a defauk decode 
25 table is loaded from the programmable unit and initialization is completed. However, 
if a decode condition is set in step 1408, control continues to step 1412. In step 1412, 
a selected decode table is loaded from the programmable unit and initialization is 
completed. For example, the initialization can take the form of compiling the source 
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code of a main processor of a data processmg apparatus. In the self-timed processor 
800, the funaional unit schedule and control information represented in the decode 
table configures the processor architecture. For example, if an error decode 
condition is detected in step 1408, an error instruction set could be loaded from the 
prognunmable unit into the decode table. Such an error instruction set can be used to 
identify, track and correa a detected error in the digital processor, its peripheral units 
or in an executing appUcation program. From steps 1410 and 1412. control continues 
to step 1414. 

In step 1414, the main processor executes instructions based on the instruction 
set implemented by the user programmable decode information to run an appUcation 
program. From step 1414, control continues to step 1416. In step 1416, a check is 
made to determine if the user wants to modify the decode information during 
execution. Hthe determination in step 1416 is affirmative, control continues to step 
1418 where the user modifies the decode table using the prt^mmable unit (e.g., 
RAM or register). Thus, in step 1418, the decode table is modified real time, for 
example, without recompiling the processor or appUcation code. From step 1418, 
control returns to step 1416. If the determination in step 1416 is negative, control 
continues to step 1420. In step 1420. the process is completed. 

To ease the process of loading new mstruction decode information, a default set 
of universal instructions could be defined and hard wired within the decode 
information of the programmable unit (e.g.. programmable memory such as ROM). 
Preferably, tiie hard wired instructions include instruaion types such as initialization 
instructions (e.g., "Load". "Transfer" or the Uke) to enable easy access to the decode 
table in the programmable unit Such instructions are also known as -house-keeping" 
25 instructions and are usually always vaUd on the processor. In contrast, instructions 
that are valid for only one operating mode of the processor (e.g., error instructions) 
can be delegated to only one of multiple instruction sets. 
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To keep the size of the decode programmable memory to a minimum, the 
allowable input bit patterns can be limited to support a predetermined set of 
instr\ictions in one operating mode. Thus, the instruction set could be limited to 256 
different instruaions per instruction set. Also for efficiency reasons, the position of 
any fields in the instruction bit pattern are fixed. Fixed fields can reduce cost and 
simplify the logic required for encoding and decoding. Thus, the instruction bit 
pattern can be defined as the least significant 7 bits of the instruaion word. 

Further, based on the user programmable scheduler controller 1302, decode 
table "program" macros, which can manipulate program particular hardware 
configurations for differing functions, can be implemented. The program macros 
could be updated within the user programmable scheduler control 1302 in real-time 
under software control Thus, the user programmable struaure permits muhiple 
levels of flexibility to the end-user. Further, the user programmable struaure permits 
different time points of flexibility in the development of the corresponding hardware 
(e.g., a digital processor). 

As described above, the user programmable scheduler control 1302 can be used 
to generate universal architecture with various multiple functional units using 
instruction bits, external control pins or a user programmable register or memory, to 
provide the scheduler control. Such an implementation allows a multi-threaded 
compiler for the processor 800 to generate instructions that would run on a user 
definable parallel architecture. Alternatively, the use of such a scheduler would allow 
the construaion of hardware programmable chips whose functionality could be 
altered by changing scheduler control bits within a register, memory or on external 
pins to reorder the execution of data among funaional units. The preferred 
embodiment of the user programmable scheduler controller 1302 and method for 
using same in a self-timed data processing apparatus suppons a completely hardware 
programmable architecture. Thus, the user programmable structure is not intended to 
be Umited to the processor 800. 
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Supporting such aH architecture within a synchronous environment requires an 
increase in complexity. The concepts of scheduUng within a synchronous paradigm 
b more complex because of the funher requirement of referencing to phases of a dock 
signal or clock cycle. As discussed above, the clock also forces a more rigid design 
environment for the functional units and must be tailored to handle the worst case 
opeiaiing case of the programmable architecture. 

Although the scheduler has been described in reference to a particular design, 
the concept is applicable to the control of general self-timed circuits with inter- 
communicating sub-blocks and not just processor architectures as has been so far 
described. Funher, the self-timed scheduler was implemented in a self-timed DSP 
design using a 4-phase control scheme. However, alternative self-timed interface 
protocols other than the 4-phase control scheme can be used. 

The preferred embodiments of the present invention allows for user 
programmable circuit control. A software designer using a hardware archheaure 
implementing the user programmable circuit control can define new ordering of 
operations to optimize a generic hardware for a specified application or software 
application program. Beneficially, software for embedded systems can tailor an 
architecture for a given application to reduce system cost or improving performance 
without requiring new custom hardware. 

As described above, the preferred embodimems of the apparatus and method 
to control asynchronous systems according to the present invention can be configured 
to implement any instruction type onto the hmctional units provided, which enables 
functional unit execution to be in any required sequential ordering, concurrent 
ordering or not aaivated at all. 

As described above, the preferred embodiments according to the present 
invention implement efficient function or work scheduUng in. a generalized 
architecture (e.g., a processor architecture), and in particular, for highly configurable 
architectures (e.g.. a processor architecture in which new instructions and hardware 
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can be addw^. Funher, the preferred embodiments of the self-timed scheduler 
according to the present invention can simplify the control structures for digital 
processors. In addition, the user programmable structure of the decode information, 
permits run-time definition of instructions and a completely hardware programmable 
architecture. The preferred embodiments according to the present invention use self- 
timing and inter-block communication to implement the preferred embodiments of 
the apparatus and method for control of asynchronous systems. 

The foregoing embodiments are merely exemplary and are not to be construed 
as Umiting the present invention. The present teaching can be readily applied to other 
types of apparatuses. The description of the present invention is intended to be 
illustrative, and not to Umit the scope of the claims. Many alternatives, modifications, 
and variations will be apparent to those skilled in the art. 
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WftTT<^ rT AIMED IS! 

1. A data processing apparatus, comprising; 

a plurality of functional units 802-812, each functional unit performing 

a set of prescribed operations; 

a scheduler controller 814 that decodes at least one current insiruaion 
to generate a functional unit schedule and control information; and 

a communications device 820 coupling the functional units and the 

scheduler controller, 

2 . A data processing apparatus, comprising: 

a pluraUty of functional units 802-812, each functional unit performing 

a set of prescribed operations; 

a data b\is 820 coupling the functional units; 

an asynchronous controller that implements variable execution times in 

at least the functional unit schedule; and 

a scheduler controller 814 that decodes at least one current instruction 
to generate a functional unit schedule and control information. 

3. A data processing apparatus, comprising: 

a pIuraHty of functional units 802-812, each functional unit performing 

a set of prescribed operationSj 

a programmable circuit 1302 that is capable of modifying an entire 

instruction; 

a scheduler controUer 814 that decodes a current instruction to perform 
a corresponding instruction task using the plurality of functional units; and 

a communications device 820 coupling the functional units, the 
programmable circuit and the scheduler controller. 
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4. The data processing apparatus of claims 1, 2 or 3, wherein the scheduler 
controller 814 executes a first current instruaion by causing each of the plurality of 
functional units 802-812 to be one of bypassed, operated concurrently with other of 
the plurality of functional units and operated sequentially after said other of the 
plurality of functional units. 

5. The data processing apparatus of claims 1 or 2, wherein an asynchronous 
control structure 816, 8 18, 822 implements variable execution times in the funaional 
unit schedule. 

6. The data processing apparatus of claims 1 or 2, wherein the scheduler 
controller 814 executes an instruction task by decoding a first current instruction and 
implementing an ordered operation on a subset of the plurality of functional units, 
wherein different forms of the first current instruction result in at least one of different 
functional unit schedules and different control information. 

7. The data processing apparatus of claims 1 or 2, wherein the scheduler 

controller 814 further comprises: 

a scheduler decoder that decodes the first current instruction to generate 
the functional unit schedule and the control information, wherein the scheduler 
decoder includes an implementation of a logic table; and 

a plurality of scheduler funaional unit controllers 816, wherein each of 
the scheduler functional unit controllers that controls self-timing of one of the 
plurality of functional units. 

8. The data processing apparatus of claim 7, wherein said each of the 
scheduler functional unit controllers further comprises: 
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a dependency monitor unit 1200 that generates a bypass signal and an 
execute signal, wherein the dependency monitor unit further comprises, 

a bypass circuit that receives a dependency signal and outputs the 

bypass signal for a corresponding funaional unit, 

a plurality of dependency circuits, wherein each of the plurality 
of dependency circuits receives the dependency signal and a corresponding 

acknowledge signal, and 

a control circuit that receives output signals from the plurality of 
dependency circuits and the bypass signal to output the execute signal for the 
corresponding functional unit; and 

a operation control unit that receives the bypass signal, the execute signal 
and a request signal and outputs an acknowledge signal, wherein the operation control 

unit further comprises, 

first and second logic-gates 908, 910, wherein the second logic-gate 

receives a reset signal and the acknowledge signal, 

a first Ogate 904 that receives the request signal, the bypass signal 
and output signals of the first logic-gates and outputs a corresponding functional unit 
start signal, 

a second C-gate 902 that receives the request signal, the execute 
signal and the output signals of the first and second logic-gates and outputs an 

intermediate signal, and 

a third Ogate 906 that receives a finish signal from the 
corresponding fimctional unit, the request signal and the output signals of the first and 
second C-gatcs and outputs the acknowledge signal, wherein the first logicgate receives 
the finish signal and the intermediate signal. 

9. The data processing apparatus of claims 1 or 2, fiirther comprising a 
programmable circuit 1302 that is capable of modifying an entire instruction. 
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10. The data processing apparatus of claims 3 or 9, wherein the instruction 
is modified real-time, and wherein the programmable circuit 1302 selects one of a 
plurality of instructions sets and loads the selerted instruaion set during initialization 
of the data processing apparatus. 

11. The data processing apparatus of claims 3 or 10, wherein the 
programmable circuit includes at least one of decode programmable memory and an 
array of registers, wherein the array of registers are at least one of user modified and 
modified by a source external to the data processing apparatus. 

12. The data processing apparatus of claims 1, 2 or 3, wherein the plurality 
of functional units includes at least one of a program counter unit 802, an instruction 
decoder unit, an arithmetic and logic unit 806, a multiplier unit 808, an indirect address 
register unit 810 and a data storage unit 812, and wherein the scheduler controller uses 

5 a three-stage instruction pipeline and a four phase communication protocol 

13. A data processing apparatus substantially as hereinbefore 
described with reference to and/ or as shown in Figures 4-14 of the 
accompanying drawings. 
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